Skip to content

fix: clear stale pod-name annotation instead of hard error#521

Open
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/fix-stale-pod-annotation
Open

fix: clear stale pod-name annotation instead of hard error#521
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/fix-stale-pod-annotation

Conversation

@noeljackson
Copy link
Copy Markdown
Contributor

Summary

When the pod tracked by agents.x-k8s.io/pod-name annotation doesn't exist, clear the stale annotation and fall through to pod creation instead of returning a hard error.

Problem

The ensurePodNameAnnotation function (commit 32cddd3) records the backing pod's name on the Sandbox CR. This is used for stable pod tracking across reconciliations. However, when the annotated pod is deleted (warm pool rotation, eviction, image pull failure), reconcilePod returns a hard error:

if podNameAnnotationExists {
    log.Error(err, "Pod not found")
    return nil, fmt.Errorf("pod in annotation get failed: %w", err)
}

The controller never reaches PATH 3 (create pod). The Sandbox is stuck in a reconcile error loop and the warm pool never becomes ready.

Fix

When the annotated pod isn't found, clear the stale annotation and let pod = nil fall through to pod creation:

if podNameAnnotationExists {
    log.Info("Tracked pod not found, clearing stale annotation", "podName", podName)
    patch := client.MergeFrom(sandbox.DeepCopy())
    delete(sandbox.Annotations, sandboxv1alpha1.SandboxPodNameAnnotation)
    if patchErr := r.Patch(ctx, sandbox, patch); patchErr != nil {
        return nil, fmt.Errorf("failed to clear stale pod name annotation: %w", patchErr)
    }
}

The subsequent ensurePodNameAnnotation call after pod creation re-sets the annotation to track the new pod.

Test plan

  • TestReconcilePodClearsStaleAnnotation — sandbox with stale annotation pointing to non-existent pod creates a new pod and updates the annotation
  • Updated table test to remove the old hard-error expectation
  • All existing reconcilePod tests pass (no behavior change for valid annotations)

When the pod tracked by agents.x-k8s.io/pod-name doesn't exist
(deleted during warm pool rotation, eviction, or image pull failure),
the controller returned a hard error, leaving the Sandbox stuck in a
reconcile loop unable to create a replacement pod.

Now the controller clears the stale annotation and falls through to
pod creation. The new pod gets tracked via ensurePodNameAnnotation.
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 3, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 18afe68
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69cff02a312f860008cb9c95

@k8s-ci-robot k8s-ci-robot requested review from barney-s and soltysh April 3, 2026 16:51
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: noeljackson
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @noeljackson. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2026
noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 3, 2026
@aditya-shantanu
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@noeljackson: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-agent-sandbox-e2e-test 18afe68 link true /test presubmit-agent-sandbox-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this PR provides an excellent and solid fix for resolving the nasty edge case of a stale pod annotation loop that blocks Sandbox reconciliation when a tracked backing Pod is deleted out-of-band (e.g., evicted or rotated from the warm pool). The approach of proactively clearing the stale annotation and falling back to standard pod creation is safe, reliable, and aligns well with standard Kubernetes controller patterns.

The PR handles object mutation correctly by deep-copying before modifying the cached object to generate the strategic merge patch, and the tests comprehensively verify both the new Pod creation and the final updated state of the Sandbox's annotations.

I've left a few minor inline comments pointing out some good practices used here, validating the architectural approach with the explicit patch, and suggesting enhancements for observability, stronger error context, and expanded test coverage (specifically ensuring that client.MergeFrom correctly preserves unrelated annotations). No blocking issues found. Great work on this fix!

(This review was generated by Overseer)

vamsi-resolve added a commit to clouddatalabs/agent-sandbox that referenced this pull request Apr 8, 2026
Cherry-picks two upstream fixes:

1. kubernetes-sigs#521 — When an adopted warm pool pod is
   deleted (node failure, drain, eviction), the controller returned a hard
   error because the agents.x-k8s.io/pod-name annotation pointed to a
   non-existent pod. This left the Sandbox stuck in a permanent reconcile
   error loop. Now the controller clears the stale annotation and falls
   through to create a replacement pod (which remounts the existing PVC).

2. kubernetes-sigs#469 — During warm pool adoption, ensure
   the pod-name annotation is correct before the sandbox can be observed
   as Ready. Prevents stale annotations from being set in the first place.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
vamsi-resolve added a commit to clouddatalabs/agent-sandbox that referenced this pull request Apr 8, 2026
Cherry-picks two upstream fixes:

1. kubernetes-sigs#521 — When an adopted warm pool pod is
   deleted (node failure, drain, eviction), the controller returned a hard
   error because the agents.x-k8s.io/pod-name annotation pointed to a
   non-existent pod. This left the Sandbox stuck in a permanent reconcile
   error loop. Now the controller clears the stale annotation and falls
   through to create a replacement pod (which remounts the existing PVC).

2. kubernetes-sigs#469 — During warm pool adoption, ensure
   the pod-name annotation is correct before the sandbox can be observed
   as Ready. Prevents stale annotations from being set in the first place.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants